imp: exec: send SIGKILL to cgroup/children on SIGUSR1 #186

grondo · 2024-10-30T15:56:59Z

This PR updates the IMP to translate SIGUSR1 to SIGKILL for flux-imp exec as defined in RFC 15.

Additionally, if possible, the IMP will signal all members of its cgroup (besides itself) when delivering SIGKILL.

Finally, to better handle returning the proper wait status to its parent, when the child of the IMP is terminated by a signal, the IMP raises that same signal to itself.

This is WIP pending real system testing.

grondo · 2024-10-30T20:38:34Z

Hm, readthedocs build seems to have been broken at some point as well. Will have to look into that.

garlick

Great!

I have some comments but all superficial and/or optional.

src/imp/cgroup.c

src/imp/signals.c

src/test/docker/docker-run-checks.sh

grondo · 2024-10-31T19:27:04Z

As part of this work I switched the cgroup controller that the IMP checks for from name=systemd to pids, since the latter can actually be used in testing. I had thought this was working, but after further investigation, that isn't quite right and the pids controller is unchanged for processes in a job managed by sdexec.

I think the correct thing (after perusing cgroups(7)) is to use the 0:: line in /proc/self/cgroup, which I think corresponds to the cgroups v2 hierarchy.

I'll have to modify this PR to make that change.

Problem: Signal forwarding is currently implemented directly in the IMP exec subcommand which makes it impossible to reuse the code in other IMP subcommands. Move signal handling to imp/signals.[ch] for better code reuse. Pass the imp state object into the signal handler for possible future use, otherwise behavior of the IMP exec command should be identical to before this change.

Problem: To comply with RFC 15, the IMP needs to know attributes of its current cgroup, but this information is not currently available. Add a cgroup_info structure to the imp_state object, and populate it at IMP startup. This moves and modifies some internal code from pidinfo.c. In the future, this duplicate code in pidinfo.c will be removed.

grondo · 2024-11-01T00:26:39Z

Ok, I think I've fixed the cgroup discovery here. The code is somewhat simplified, at startup the IMP:

checks if either /sys/fs/cgroup or /sys/fs/cgroup/unified are mounted as cgroup2
if not then check for a legacy (cgroup v1) mount at /sys/fs/cgroup/systemd
grab the current relative cgroup path for either case from /proc/self/cgroup (logic to find the correct one is now fixed)

I also had to add a workaround to remove leading /.. from the relative cgroup path, which seems to pop up when run within a container (the relative path is relative to the container mount point on the host I think)

Problem: RFC 15 specifies that the IMP should handle SIGUSR1 as a surrogate for SIGKILL, but this is not currently supported. Add cgroup_kill() which will send a signal to all member's of the IMP's cgroup (besides the IMP's own pid). Add SIGUSR1 to the set of signals handled by the signal forwarding implementation of the IMP. In the case of SIGUSR1, deliver SIGKILL to all members of the current cgroup if use_cgroup_kill is true, otherwise fallback to pid_kill_children().

Problem: The lingering IMP exits with 127+signo if the invoked child was terminated by a signal during `flux-imp exec`, but ideally the IMP would exit with the exact wait status of the child. Add a convenience function, `imp_raise()`, which raises a signal after setting the handler to the default disposition for the signal in question. Use imp_raise() to deliver the same signal which terminated the child in `flux-imp exec` if the child is detected to have exited due to a signal.

Problem: Some tests in the flux-security testsuite may need write access to cgroups, but docker-run-checks.sh explicitly mounts cgroups as read-only. Mount cgroups read-write for testing purposes.

Problem: The IMP exec tests that use sign-none.toml and sudo are unnecessarily complicated because the IMP has to be run in a subshell, sometimes in the background. This not only duplicates a lot of code, but makes it impossible to get the exit code of the IMP when it is run in the background. Add a function to run `flux-imp exec` under sudo with the sign-none.toml config. While we're at it, create the `sleeper.sh` helper at the beginning of the tests, instead of as part of another test. Update callers to use `imp_exec_sign_none()` where possible.

Problem: There are no tests the ensure sending SIGUSR1 to the IMP causes it to send SIGKILL to its children. Add a couple tests that ensure SIGKILL is sent when the IMP is running both in and out of an "imp-shell" cgroup.

Problem: pstree(1) is not available on all test platforms. Remove its use in debugging tests.

Problem: The readthedocs build fails with sphinx.errors.VersionRequirementError: 5.0 Update doc/requirements.txt to match those in flux-core, including an updated sphinx version requirement.

Problem: Some distros don't seem to have CGROUP2_SUPER_MAGIC defined in linux/magic.h. Copy the magic value from a system that has it. Remove the ifndef CGROUP2_SUPER_MAGIC block since it will now always be defined.

Problem: Relative paths for cgroups in /proc/self/cgroup may contain leading `/..` when the IMP is run inside some containers. This is because the path is relative to the container's mount point, not the actual mount point of the cgroupfs. Just strip the leading /.. from paths during discovery of the current cgroup path from `/proc/self/cgroup`. This should only apply in containers and thus CI.

garlick · 2024-11-01T15:49:53Z

I grabbed your snapshot from last night and put it on my test cluster. This morning I made some minimal changes to job-exec on my issue#6406 branch to send SIGUSR1 instead of imp-kill when shell termination methods fail. Then I ran a test on my system by starting a long running job, sending the flux-shell a SIGSTOP, then canceling the job. Looks good! Flux on rank 0 logged:

[  +5.002092] job-exec[0]: Sending SIGKILL to job ƒ3L9MFPYEKZ
[ +10.005491] job-exec[0]: Sending SIGKILL to job ƒ3L9MFPYEKZ
[ +15.005766] job-exec[0]: Sending SIGKILL to job ƒ3L9MFPYEKZ
[ +20.005115] job-exec[0]: Sending SIGKILL to job ƒ3L9MFPYEKZ
[ +25.005382] job-exec[0]: Sending SIGKILL to job ƒ3L9MFPYEKZ
[ +25.005415] job-exec[0]: Sending SIGUSR1 to IMP for job ƒ3L9MFPYEKZ
[ +25.169307] job-manager[0]: housekeeping: ƒ3L9MFPYEKZ started
[ +25.201629] job-list[0]: purged 1 inactive jobs
[ +25.203373] job-manager[0]: purged 1 inactive jobs
[Nov01 08:43] job-manager[0]: housekeeping: ƒ3L9MFPYEKZ complete

On the compute node, systemd logged

Nov 01 08:41:05 picl1 systemd[796]: Starting imp-shell-1-f3L9MFPYEKZ.service - User workload...
Nov 01 08:41:05 picl1 systemd[796]: Started imp-shell-1-f3L9MFPYEKZ.service - User workload.
Nov 01 08:42:32 picl1 systemd[796]: imp-shell-1-f3L9MFPYEKZ.service: Sent signal SIGTERM to main process 907 (flux-imp) on client request.
Nov 01 08:42:57 picl1 systemd[796]: imp-shell-1-f3L9MFPYEKZ.service: Sent signal SIGUSR1 to main process 907 (flux-imp) on client request.
Nov 01 08:42:57 picl1 systemd[796]: imp-shell-1-f3L9MFPYEKZ.service: Main process exited, code=killed, status=9/KILL
Nov 01 08:42:57 picl1 systemd[796]: imp-shell-1-f3L9MFPYEKZ.service: Failed with result 'signal'.

grondo · 2024-11-01T16:19:12Z

This morning I added one fix here: cgroup_kill() was always sending SIGKILL instead of the sig parameter. Fixed.

garlick

LGTM!

grondo · 2024-11-01T16:42:26Z

Great, thanks! I'll set MWP. I'll post a PR to have the IMP wait until its cgroup is empty in a bit.

grondo force-pushed the issue#185 branch from dd31be6 to 942c6df Compare October 30, 2024 16:09

garlick mentioned this pull request Oct 31, 2024

sdexec: systemd tries to kill guest tasks without sufficient permission flux-framework/flux-core#6399

Closed

grondo force-pushed the issue#185 branch from 0eb507e to 2664bd0 Compare October 31, 2024 18:22

grondo changed the title ~~WIP: imp: exec: send SIGKILL to cgroup/children on SIGUSR1~~ imp: exec: send SIGKILL to cgroup/children on SIGUSR1 Oct 31, 2024

garlick previously approved these changes Oct 31, 2024

View reviewed changes

grondo dismissed garlick’s stale review via fcfabd5 October 31, 2024 22:58

grondo force-pushed the issue#185 branch 2 times, most recently from 22462f7 to 4668b2c Compare November 1, 2024 00:11

grondo force-pushed the issue#185 branch from 4668b2c to d67ea07 Compare November 1, 2024 00:21

grondo added 9 commits November 1, 2024 15:15

docker: mount cgroups rw

87f2c7f

Problem: Some tests in the flux-security testsuite may need write access to cgroups, but docker-run-checks.sh explicitly mounts cgroups as read-only. Mount cgroups read-write for testing purposes.

testsuite: test IMP SIGUSR1 handling

5bb34b8

Problem: There are no tests the ensure sending SIGUSR1 to the IMP causes it to send SIGKILL to its children. Add a couple tests that ensure SIGKILL is sent when the IMP is running both in and out of an "imp-shell" cgroup.

testsuite: remove calls to pstree(1)

7e1093e

Problem: pstree(1) is not available on all test platforms. Remove its use in debugging tests.

doc: update readthedocs requirements

21aebcc

Problem: The readthedocs build fails with sphinx.errors.VersionRequirementError: 5.0 Update doc/requirements.txt to match those in flux-core, including an updated sphinx version requirement.

imp: cgroup: copy definition of CGROUP2_SUPER_MAGIC

1b38809

Problem: Some distros don't seem to have CGROUP2_SUPER_MAGIC defined in linux/magic.h. Copy the magic value from a system that has it. Remove the ifndef CGROUP2_SUPER_MAGIC block since it will now always be defined.

grondo force-pushed the issue#185 branch from d67ea07 to 35ec95d Compare November 1, 2024 15:15

garlick approved these changes Nov 1, 2024

View reviewed changes

grondo added the merge-when-passing label Nov 1, 2024

mergify bot merged commit 1d7cf73 into flux-framework:master Nov 1, 2024
17 checks passed

grondo deleted the issue#185 branch November 1, 2024 16:44

garlick mentioned this pull request Nov 5, 2024

imp: implement signal forwarding specification from RFC 15 #185

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

imp: exec: send SIGKILL to cgroup/children on SIGUSR1 #186

imp: exec: send SIGKILL to cgroup/children on SIGUSR1 #186

grondo commented Oct 30, 2024

grondo commented Oct 30, 2024

garlick left a comment

grondo commented Oct 31, 2024

grondo commented Nov 1, 2024

garlick commented Nov 1, 2024

grondo commented Nov 1, 2024

garlick left a comment

grondo commented Nov 1, 2024

imp: exec: send SIGKILL to cgroup/children on SIGUSR1 #186

imp: exec: send SIGKILL to cgroup/children on SIGUSR1 #186

Conversation

grondo commented Oct 30, 2024

grondo commented Oct 30, 2024

garlick left a comment

Choose a reason for hiding this comment

grondo commented Oct 31, 2024

grondo commented Nov 1, 2024

garlick commented Nov 1, 2024

grondo commented Nov 1, 2024

garlick left a comment

Choose a reason for hiding this comment

grondo commented Nov 1, 2024